{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning Goals\n",
    "\n",
    "## Optimizing Foundation Models with Supervised Fine-Tuning (SFT)\n",
    "\n",
    "Often, we want to adapt or customize foundation models to be more performant on our specific task. Fine-tuning refers to how we can modify the weights of a pre-trained foundation model with additional custom data. Supervised Fine-Tuning (SFT) refers to unfreezing all the weights and layers in our model and training on a newly labeled set of examples. We can fine-tune to incorporate new, domain-specific knowledge, or teach the foundation model what type of response to provide. One specific type of SFT is also referred to as “instruction tuning” where we use SFT to teach a model to follow instructions better. In this tutorial, will demonstrate how to perform SFT with Llama3-8b using NeMo 2.0.\n",
    "\n",
    "NeMo 2.0 introduces Python-based configurations, PyTorch Lightning’s modular abstractions, and NeMo-Run for scaling experiments across multiple GPUs. In this notebook, we will use NeMo-Run to streamline the configuration and execution of our experiments.\n",
    "\n",
    "## Data\n",
    "Databricks-dolly-15k is an open-source dataset created by the collaborative efforts of Databricks employees. It consists of high-quality, human-generated prompt/response pairs specifically designed for instruction tuning LLMs. These pairs cover a diverse range of behaviors, from brainstorming and content generation to information extraction and summarization. \n",
    "\n",
    "For more information, refer to [databricks-dolly-15k | Hugging Face](https://huggingface.co/datasets/databricks/databricks-dolly-15k)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Step 1. Import the Hugging Face Checkpoint\n",
    "We use the `llm.import_ckpt` API to download the specified model using the \"hf://<huggingface_model_id>\" URL format. It will then convert the model into NeMo 2.0 format. For all model supported in NeMo 2.0, refer to [Large Language Models](https://docs.nvidia.com/nemo-framework/user-guide/24.09/llms/index.html#large-language-models) section of NeMo Framework User Guide."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import nemo_run as run\n",
    "from nemo import lightning as nl\n",
    "from nemo.collections import llm\n",
    "from megatron.core.optimizer import OptimizerConfig\n",
    "import torch\n",
    "import lightning.pytorch as pl\n",
    "from pathlib import Path\n",
    "from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed\n",
    "\n",
    "\n",
    "# llm.import_ckpt is the nemo2 API for converting Hugging Face checkpoint to NeMo format\n",
    "# example python usage:\n",
    "# llm.import_ckpt(model=llm.llama3_8b.model(), source=\"hf://meta-llama/Meta-Llama-3-8B\")\n",
    "#\n",
    "# We use run.Partial to configure this function\n",
    "def configure_checkpoint_conversion():\n",
    "    return run.Partial(\n",
    "        llm.import_ckpt,\n",
    "        model=llm.llama3_8b.model(),\n",
    "        source=\"hf://meta-llama/Meta-Llama-3-8B\",\n",
    "        overwrite=False,\n",
    "    )\n",
    "\n",
    "# configure your function\n",
    "import_ckpt = configure_checkpoint_conversion()\n",
    "# define your executor\n",
    "local_executor = run.LocalExecutor()\n",
    "\n",
    "# run your experiment\n",
    "run.run(import_ckpt, executor=local_executor)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2. Prepare the Data and Customize the DataModule\n",
    "\n",
    "We will be using Databricks-dolly-15k for this notebook. NeMo 2.0 already provides a `DollyDataModule`. For all data modules that are included in NeMo 2.0, refer to the [data module directory](https://github.com/NVIDIA/NeMo/tree/main/nemo/collections/llm/gpt/data). Example usage:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def dolly() -> run.Config[pl.LightningDataModule]:\n",
    "    return run.Config(llm.DollyDataModule, seq_length=2048, micro_batch_size=1, global_batch_size=8, num_workers=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use your own data, you will need to create a custom `DataModule`. This involves extending the base class `FineTuningDataModule` so that you have access to existing data handling logic, such as packed sequences. Here we walk you through the process step by step, using the already existing [`DollyDataModule`](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/dolly.py) as an example. \n",
    "\n",
    "### Subclass the FineTuningDataModule\n",
    "You need to extend the `FineTuningDataModule` if you're fine-tuning NeMo models. This provides access to existing data handling logic, such as packed sequences. The `data_root` parameter is where you store your generated `train/validation/test.jsonl` in NeMo format. Below is how `DollyDataModule` does it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "from nemo.lightning.io.mixin import IOMixin\n",
    "from nemo.collections.llm.gpt.data.fine_tuning import FineTuningDataModule\n",
    "\n",
    "class DollyDataModule(FineTuningDataModule, IOMixin):\n",
    "    def __init__(\n",
    "        self,\n",
    "        seq_length: int = 2048,\n",
    "        tokenizer: Optional[\"TokenizerSpec\"] = None,\n",
    "        micro_batch_size: int = 4,\n",
    "        global_batch_size: int = 8,\n",
    "        rampup_batch_size: Optional[List[int]] = None,\n",
    "        force_redownload: bool = False,\n",
    "        delete_raw: bool = True,\n",
    "        seed: int = 1234,\n",
    "        memmap_workers: int = 1,\n",
    "        num_workers: int = 8,\n",
    "        pin_memory: bool = True,\n",
    "        persistent_workers: bool = False,\n",
    "        pad_to_max_length: bool = False,\n",
    "        packed_sequence_size: int = -1,\n",
    "    ):\n",
    "        self.force_redownload = force_redownload\n",
    "        self.delete_raw = delete_raw\n",
    "\n",
    "        super().__init__(\n",
    "            dataset_root=get_dataset_root(\"dolly\"),\n",
    "            seq_length=seq_length,\n",
    "            tokenizer=tokenizer,\n",
    "            micro_batch_size=micro_batch_size,\n",
    "            global_batch_size=global_batch_size,\n",
    "            rampup_batch_size=rampup_batch_size,\n",
    "            seed=seed,\n",
    "            memmap_workers=memmap_workers,\n",
    "            num_workers=num_workers,\n",
    "            pin_memory=pin_memory,\n",
    "            persistent_workers=persistent_workers,\n",
    "            pad_to_max_length=pad_to_max_length,\n",
    "            packed_sequence_size=packed_sequence_size,\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Override the `prepare_data` Method\n",
    "\n",
    "The `prepare_data` method is responsible for downloading and preprocessing data if needed. If the dataset is already downloaded, you can skip this step.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def prepare_data(self) -> None:\n",
    "    # if train file is specified, no need to do anything\n",
    "    if not self.train_path.exists() or self.force_redownload:\n",
    "        dset = self._download_data()\n",
    "        self._preprocess_and_split_data(dset)\n",
    "    super().prepare_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implement Data Download and Preprocessing Logic\n",
    "\n",
    "If your dataset requires downloading or preprocessing, implement this logic within the helper methods. Skip the download part if it's not needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def _download_data(self):\n",
    "    logging.info(f\"Downloading {self.__class__.__name__}...\")\n",
    "    return load_dataset(\n",
    "        \"databricks/databricks-dolly-15k\",\n",
    "        cache_dir=str(self.dataset_root),\n",
    "        download_mode=\"force_redownload\" if self.force_redownload else None,\n",
    "    )\n",
    "\n",
    "def _preprocess_and_split_data(self, dset, train_ratio: float = 0.80, val_ratio: float = 0.15):\n",
    "    logging.info(f\"Preprocessing {self.__class__.__name__} to jsonl format and splitting...\")\n",
    "\n",
    "    test_ratio = 1 - train_ratio - val_ratio\n",
    "    save_splits = {}\n",
    "    dataset = dset.get('train')\n",
    "    split_dataset = dataset.train_test_split(test_size=val_ratio + test_ratio, seed=self.seed)\n",
    "    split_dataset2 = split_dataset['test'].train_test_split(\n",
    "        test_size=test_ratio / (val_ratio + test_ratio), seed=self.seed\n",
    "    )\n",
    "    save_splits['training'] = split_dataset['train']\n",
    "    save_splits['validation'] = split_dataset2['train']\n",
    "    save_splits['test'] = split_dataset2['test']\n",
    "\n",
    "    for split_name, dataset in save_splits.items():\n",
    "        output_file = self.dataset_root / f\"{split_name}.jsonl\"\n",
    "        with output_file.open(\"w\", encoding=\"utf-8\") as f:\n",
    "            for example in dataset:\n",
    "                context = example[\"context\"].strip()\n",
    "                if context != \"\":\n",
    "                    # Randomize context and instruction order.\n",
    "                    context_first = np.random.randint(0, 2) == 0\n",
    "                    if context_first:\n",
    "                        instruction = example[\"instruction\"].strip()\n",
    "                        assert instruction != \"\"\n",
    "                        _input = f\"{context}\\n\\n{instruction}\"\n",
    "                        _output = example[\"response\"]\n",
    "                    else:\n",
    "                        instruction = example[\"instruction\"].strip()\n",
    "                        assert instruction != \"\"\n",
    "                        _input = f\"{instruction}\\n\\n{context}\"\n",
    "                        _output = example[\"response\"]\n",
    "                else:\n",
    "                    _input = example[\"instruction\"]\n",
    "                    _output = example[\"response\"]\n",
    "\n",
    "                f.write(json.dumps({\"input\": _input, \"output\": _output, \"category\": example[\"category\"]}) + \"\\n\")\n",
    "\n",
    "        logging.info(f\"{split_name} split saved to {output_file}\")\n",
    "\n",
    "    if self.delete_raw:\n",
    "        for p in self.dataset_root.iterdir():\n",
    "            if p.is_dir():\n",
    "                shutil.rmtree(p)\n",
    "            elif '.jsonl' not in str(p.name):\n",
    "                p.unlink()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original example in Dolly dataset looks like:\n",
    "```\n",
    "{'instruction': 'Extract all the movies from this passage and the year they were released out. Write each movie as a separate sentence', 'context': \"The genre has existed since the early years of silent cinema, when Georges Melies' A Trip to the Moon (1902) employed trick photography effects. The next major example (first in feature length in the genre) was the film Metropolis (1927). From the 1930s to the 1950s, the genre consisted mainly of low-budget B movies. After Stanley Kubrick's landmark 2001: A Space Odyssey (1968), the science fiction film genre was taken more seriously. In the late 1970s, big-budget science fiction films filled with special effects became popular with audiences after the success of Star Wars (1977) and paved the way for the blockbuster hits of subsequent decades.\", 'response': 'A Trip to the Moon was released in 1902. Metropolis came out in 1927. 2001: A Space Odyssey was released in 1968. Star Wars came out in 1977.', 'category': 'information_extraction'}\n",
    "```\n",
    "After the preprocessing logic, the data examples are transformed into NeMo format, as below:\n",
    "```\n",
    "{'input': \"Extract all the movies from this passage and the year they were released out. Write each movie as a separate sentence\\n\\nThe genre has existed since the early years of silent cinema, when Georges Melies' A Trip to the Moon (1902) employed trick photography effects. The next major example (first in feature length in the genre) was the film Metropolis (1927). From the 1930s to the 1950s, the genre consisted mainly of low-budget B movies. After Stanley Kubrick's landmark 2001: A Space Odyssey (1968), the science fiction film genre was taken more seriously. In the late 1970s, big-budget science fiction films filled with special effects became popular with audiences after the success of Star Wars (1977) and paved the way for the blockbuster hits of subsequent decades.\", 'output': 'A Trip to the Moon was released in 1902. Metropolis came out in 1927. 2001: A Space Odyssey was released in 1968. Star Wars came out in 1977.', 'category': 'information_extraction'}\n",
    "```\n",
    "Each data example is saved as a json string as one line in the `train/validation/test.jsonl` file, under `data_root` directory you specified earlier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3.1: Configure SFT with the NeMo 2.0 API \n",
    "\n",
    "In this notebook we use NeMo 2.0 API to perform SFT. First we configure the following components for training. These components are similar between SFT and PEFT. SFT and PEFT both uses `llm.finetune` API. To switch from PEFT to SFT, you just need to remove the `peft` parameter.\n",
    "\n",
    "### Configure the Trainer\n",
    "The NeMo 2.0 Trainer works similarly to the PyTorch Lightning trainer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def trainer() -> run.Config[nl.Trainer]:\n",
    "    strategy = run.Config(\n",
    "        nl.MegatronStrategy,\n",
    "        tensor_model_parallel_size=2\n",
    "    )\n",
    "    trainer = run.Config(\n",
    "        nl.Trainer,\n",
    "        devices=2,\n",
    "        max_steps=20,\n",
    "        accelerator=\"gpu\",\n",
    "        strategy=strategy,\n",
    "        plugins=bf16_mixed(),\n",
    "        log_every_n_steps=1,\n",
    "        limit_val_batches=2,\n",
    "        val_check_interval=2,\n",
    "        num_sanity_val_steps=0,\n",
    "    )\n",
    "    return trainer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Configure the Logger\n",
    "Configure your training steps, output directories and logging through `NeMoLogger`. In the following example, the experiment output will be saved at `./results/nemo2_sft`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def logger() -> run.Config[nl.NeMoLogger]:\n",
    "    ckpt = run.Config(\n",
    "        nl.ModelCheckpoint,\n",
    "        save_last=True,\n",
    "        every_n_train_steps=10,\n",
    "        monitor=\"reduced_train_loss\",\n",
    "        save_top_k=1,\n",
    "        save_on_train_epoch_end=True,\n",
    "        save_optim_on_train_end=True,\n",
    "    )\n",
    "\n",
    "    return run.Config(\n",
    "        nl.NeMoLogger,\n",
    "        name=\"nemo2_sft\",\n",
    "        log_dir=\"./results\",\n",
    "        use_datetime_version=False,\n",
    "        ckpt=ckpt,\n",
    "        wandb=None\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "### Configure the Optimizer\n",
    "In the following example, we will be using the distributed adam optimizer and pass in the optimizer configuration through `OptimizerConfig`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def adam_with_cosine_annealing() -> run.Config[nl.OptimizerModule]:\n",
    "    opt_cfg = run.Config(\n",
    "        OptimizerConfig,\n",
    "        optimizer=\"adam\",\n",
    "        lr=5e-6,\n",
    "        adam_beta2=0.98,\n",
    "        use_distributed_optimizer=True,\n",
    "        clip_grad=1.0,\n",
    "        bf16=True,\n",
    "    )\n",
    "    return run.Config(\n",
    "        nl.MegatronOptimizerModule,\n",
    "        config=opt_cfg\n",
    "    )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure the Base Model\n",
    "We will perform SFT on top of Llama3-8B, so we create a `LlamaModel` to pass to the finetune API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def llama3_8b() -> run.Config[pl.LightningModule]:\n",
    "    return run.Config(llm.LlamaModel, config=run.Config(llm.Llama3Config8B))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Auto Resume\n",
    "In NeMo 2.0, we can directly pass in the Llama3-8b Hugging Face ID to start SFT without manually converting it into the NeMo checkpoint format, as required in NeMo 1.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def resume() -> run.Config[nl.AutoResume]:\n",
    "    return run.Config(\n",
    "        nl.AutoResume,\n",
    "        restore_config=run.Config(nl.RestoreConfig,\n",
    "            path=\"nemo://meta-llama/Meta-Llama-3-8B\"\n",
    "        ),\n",
    "        resume_if_exists=True,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Configure NeMo 2.0 finetune API\n",
    "Using all the components we created above, we can call the NeMo 2.0 finetune API. The python example usage is as below:\n",
    "```\n",
    "llm.finetune(\n",
    "    model=llama3_8b(),\n",
    "    data=dolly(),\n",
    "    trainer=trainer(),\n",
    "    log=logger(),\n",
    "    optim=adam_with_cosine_annealing(),\n",
    "    resume=resume(),\n",
    ")\n",
    "```\n",
    "We configure the `llm.finetune` API as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def configure_finetuning_recipe():\n",
    "    return run.Partial(\n",
    "        llm.finetune,\n",
    "        model=llama3_8b(),\n",
    "        trainer=trainer(),\n",
    "        data=dolly(),\n",
    "        log=logger(),\n",
    "        optim=adam_with_cosine_annealing(),\n",
    "        resume=resume(),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3.2: Run SFT with NeMo 2.0 API and NeMo-Run\n",
    "\n",
    "We use `LocalExecutor` for executing our configured finetune function. For more details on the NeMo-Run executor, refer to [Execute NeMo Run](https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md) of NeMo-Run Guides. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:\n",
    "    # Env vars for jobs are configured here\n",
    "    env_vars = {\n",
    "        \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\",\n",
    "        \"NCCL_NVLS_ENABLE\": \"0\",\n",
    "    }\n",
    "\n",
    "    executor = run.LocalExecutor(ntasks_per_node=devices, launcher=\"torchrun\", env_vars=env_vars)\n",
    "\n",
    "    return executor\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    run.run(configure_finetuning_recipe(), executor=local_executor_torchrun())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4. Generate Results from Trained SFT Checkpoints\n",
    "\n",
    "We use the `llm.generate` API in NeMo 2.0 to generate results from the trained SFT checkpoint. Find your last saved checkpoint from your experiment dir: `results/nemo2_sft/checkpoints`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "sft_ckpt_path=str(next((d for d in Path(\"./results/nemo2_sft/checkpoints/\").iterdir() if d.is_dir() and d.name.endswith(\"-last\")), None))\n",
    "print(\"We will load SFT checkpoint from:\", sft_ckpt_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using `llm.generate` API, you can pass a data module such as dolly: `input_dataset=dolly()`. This will use the test set from the specified data module to generate predictions. In the following example, the generated predictions are saved to the `sft_predictions.txt` file. Note that while fine-tuning required a minimum of 2 GPUs with `tensor_model_parallel_size=2`, generating predictions only requires `tensor_model_parallel_size=1`. However, using multiple GPUs can speed up the inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from megatron.core.inference.common_inference_params import CommonInferenceParams\n",
    "\n",
    "\n",
    "def trainer() -> run.Config[nl.Trainer]:\n",
    "    strategy = run.Config(\n",
    "        nl.MegatronStrategy,\n",
    "        tensor_model_parallel_size=1,\n",
    "    )\n",
    "    trainer = run.Config(\n",
    "        nl.Trainer,\n",
    "        accelerator=\"gpu\",\n",
    "        devices=1,\n",
    "        num_nodes=1,\n",
    "        strategy=strategy,\n",
    "        plugins=bf16_mixed(),\n",
    "    )\n",
    "    return trainer\n",
    "\n",
    "def configure_inference():\n",
    "    return run.Partial(\n",
    "        llm.generate,\n",
    "        path=str(sft_ckpt_path),\n",
    "        trainer=trainer(),\n",
    "        input_dataset=dolly(),\n",
    "        inference_params=CommonInferenceParams(num_tokens_to_generate=20, top_k=1),\n",
    "        output_path=\"sft_prediction.jsonl\",\n",
    "    )\n",
    "\n",
    "\n",
    "def local_executor_torchrun(nodes: int = 1, devices: int = 1) -> run.LocalExecutor:\n",
    "    # Env vars for jobs are configured here\n",
    "    env_vars = {\n",
    "        \"TORCH_NCCL_AVOID_RECORD_STREAMS\": \"1\",\n",
    "        \"NCCL_NVLS_ENABLE\": \"0\",\n",
    "    }\n",
    "\n",
    "    executor = run.LocalExecutor(ntasks_per_node=devices, launcher=\"torchrun\", env_vars=env_vars)\n",
    "\n",
    "    return executor\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    run.run(configure_inference(), executor=local_executor_torchrun())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the inference is complete, you will see results similar to the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "head -n 3 sft_prediction.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should see output similar to the following:\n",
    "```\n",
    "{\"input\": \"What is best creator's platform\", \"category\": \"brainstorming\", \"label\": \"Youtube. Youtube should be best creator platform\", \"prediction\": \" for video content creators. YouTube is best creator's platform for video content creators.\"}\n",
    "{\"input\": \"When was the last time the Raiders won the Super Bowl?\", \"category\": \"open_qa\", \"label\": \"The Raiders have won three Super Bowl championships (1977, 1981, and 1984), one American Football League (AFL) championship (1967), and four American Football Conference (AFC) titles. The most recent Super Bowl ring was won in 1984 against the Washington Redskins of the NFC.\", \"prediction\": \" 2003\"}\n",
    "{\"input\": \"Muckle Water is a long, narrow fresh water loch on Ward Hill on Rousay, Orkney, Scotland. It is the biggest loch on the island and is popular for fishing. It can be reached by a track from the roadside. The Suso Burn on the north eastern shore drains the loch into the Sound of Rousay.\\n\\nWhere is Muckle Water?\", \"category\": \"closed_qa\", \"label\": \"Muckle water is located in Rousay, Orkney, Scotland.\", \"prediction\": \" Muckle Water is a long, narrow fresh water loch on Ward Hill on Rousay,\"}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5. Calculate Evaluation Metrics\n",
    "\n",
    "We can evaluate the model's predictions by calculating the Exact Match (EM) and F1 scores.\n",
    "- Exact Match is a binary measure (0 or 1) checking if the model outputs match one of the\n",
    "ground truth answer exactly.\n",
    "- F1 score is the harmonic mean of precision and recall for the answer words.\n",
    "\n",
    "Below is a script that computes these metrics. The sample scores can be improved by training the model further and performing hyperparameter tuning. In this notebook, we only train for 20 steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!python /opt/NeMo/scripts/metric_calculation/peft_metric_calc.py --pred_file sft_prediction.jsonl --label_field \"label\" --pred_field \"prediction\""
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
